| Idan Chen project - Personal Key Indicators of Heart Disease


| Table Of Contecnts:


| Imports


| Data preperation


| Missing values or outliers

Tip: zero nan = 😀😀 = make my life much better.

| Basic visualisations for analysis

I chose to add basic visualisations to show the problem before the model ...

I'll need to use dummy bariables for categorical variables

and other techniques like big imbalance that will destroy my model

histplot of SleepTime devided to 24 bins (number of hours per day)

There are 551 people that sleep 1 hour per day !

And 30 people that sleep 24 hours per day ! ( I think they are died :/)

kdeplot of BMI divided by Heart Disease Label

Explanation: I added BMI photo explanation for checking if the result of the BMI's distribution is ok

Function that get df and column name and plot countplot with bar labels

count plot of Age Category


| Find the best columns for ML on HeartDisease based on their Corr


After Reaserch by using catplot and pandas corr I Found the suits colums are:

By the way - It's very intersting to figure out that Sleep Time doesn't effect Heart Disease

Danger: I found the suitests columns but the corrlelation is not great at all (the best is Stroke and DiffWaliking with 0.2 corr)

| Dummy variables for categorical variables


| Balance the data

count plot of Heart Disease

There is 10X more people without Heart Disease then people with Heart Disease :

we need to fix the dummy varibales and the imbalace in the Y target ('HeartDisease' column)

There are two methods: under sampling and over sampling

I will use under sampling becuse I have a lot of data (27373 rows of Yes target)

and becuse of that I prefer to lower the people without Heart Disease to this number

and not to use Over Sampling to Create more sampling with people with Heart Disease .

If I had small number of people with HeartDisease I'll use Over Sampling

Success: We finish to prepere the data and now we can move forward to Model Training

see the data after the Preparation


| Model Training

Important functions for model training


| Logistic Regression


| DecisionTreeClassifier


| KNeighborsClassifier


| Comparison


| SUMMARY

When I first started the project I looked for interesting data, I searched on the Kaggle website and find this DataSet: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

This data looked interesting to me because it has a lot of comments and likes, and then when I looked into his columns I found a lot of indicators of heart disease: Age, BMI, Storke, Sex and even some indicators I did not even think about them: DiffWalking, kidney disease, Race and exc ...

First of all, I thought about My business needs and I understand what I'm trying to address is to understand how adult people can improve their health status

For the ML my y target is "heart disease" and I will predict the result (0/1) based on my features and when I will know the weight of each feature I will understand how adult people can Improve the chance of not getting heart disease.

After I looked at the data - I started to clean him and delete all the null values, after that, I created basic Visualisations for understanding the data pattern.

When I finished this part I created a heatmap and barh of the corr of the data related to the y-target (heart disease column) and figure out that Diffwalking, Stoke, Diabetic, physical health and kidney disease is the top 5 related to heart disease.

In the next part, I created Dummpy Variables for the ML training part and then Balance the data -> there was the ratio of 1:10 for the non Heart Deasese people so I did Random Under Sample on the data to decrease the people without heart disease and to make the ratio of 1: 1

I used 3 types of ML models:

And then create a plot to see the different on the results between them, I found the

accuracy_score, precision_score, recall_score, f1_score, predict_proba, roc_curve and roc_auc_score

When I look at the final plot that creates the competition between all the scores It seems to me that Logistic Registration is the best fit for this DataSet, It won on every score except for the precision score.

For my Business question: I will say to people that want to decrease their chances to get a heart deasse to:


Success: Finish the Heart Disease Project! Thank you very much for reading